SAINS MALAYSIANA

Sains Malaysiana 52(9)(2023): 2725-2732

http://doi.org/10.17576/jsm-2023-5209-20

Statistical Methods for Finding Outliers in Multivariate Data using a Boxplot and Multiple Linear Regression

(Kaedah Statistik untuk Mencari Data Terpencil dalam Data Multivariat menggunakan Plot Kotak dan Regresi Linear Berganda)

THEERAPHAT THANWISET & WUTTICHAI SRISODAPHOL*

Department of Statistics, Khon Kaen University, 40002 Khon Kaen, Thailand

Received: 1 December 2022/Accepted: 15 August 2023

Abstract

The objective of this study was to propose a method for detecting outliers in multivariate data. It is based on a boxplot and multiple linear regression. In our proposed method, the box plot was initially applied to filter the data across all variables to split the data set into two sets: normal data (belonging to the upper and lower fences of the boxplot) and data that could be outliers. The normal data was then used to construct a multiple linear regression model and find the maximum error of the residual to denote the cut-off point. For the performance evaluation of the proposed method, a simulation study for multivariate normal data with and without contaminated data was conducted at various levels. The previous methods were compared with the performance of the proposed methods, namely, the Mahalanobis distance and Mahalanobis distance with the robust estimators using the minimum volume ellipsoid method, the minimum covariance determinant method, and the minimum vector variance method. The results showed that the proposed method had the best performance over other methods that were compared for all the contaminated levels. It was also found that when the proposed method was used with real data, it was able to find outlier values that were in line with the real data.

Keywords: Boxplot; multivariate data; multiple linear regression; outlier

Abstrak

Objektif kajian ini adalah untuk mencadangkan kaedah untuk mengesan data terpencil dalam data multivariat. Ia berdasarkan plot kotak dan regresi linear berganda. Dalam kaedah yang kami cadangkan, plot kotak pada mulanya digunakan untuk menapis data merentas semua pemboleh ubah untuk membahagikan set data kepada dua set: data biasa (kepunyaan pagar atas dan bawah plot kotak) dan data yang boleh menjadi data terpencil. Data biasa kemudiannya digunakan untuk membina model regresi linear berganda dan mencari ralat maksimum baki untuk menandakan titik potong. Untuk penilaian prestasi kaedah yang dicadangkan, kajian simulasi untuk data normal multivariat dengan dan tanpa data tercemar telah dijalankan pada pelbagai peringkat. Kaedah sebelumnya dibandingkan dengan prestasi kaedah yang dicadangkan, iaitu, jarak Mahalanobis dan jarak Mahalanobis dengan penganggar teguh menggunakan kaedah ellipsoid isi padu minimum, kaedah penentu kovarian minimum dan kaedah varians vektor minimum. Keputusan menunjukkan bahawa kaedah yang dicadangkan mempunyai prestasi terbaik berbanding kaedah lain yang dibandingkan untuk semua tahap yang tercemar. Didapati juga apabila kaedah yang dicadangkan digunakan dengan data sebenar, ia dapat mencari nilai data terpencil yang selari dengan data sebenar.

Kata kunci: Data berbilang variasi; data terpencil; plot kotak; regresi linear berganda

REFERENCES

Aelst, S.V. & Rousseeuw, P. 2009. Minimum volume ellipsoid. WIREs Computational Statistics 1: 71-82.

Anscombe, F.J. & Guttman, I. 1960. Rejection of outliers. Technometrics 2(2): 123-147.

Belsley, D.A., Kuh, E. & Welsch, R.E. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons.

Cabana, E., Lillo, R.E. & Laniado, H. 2021. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Stat Papers 62: 1583-1609.

Cook, R.D. 1977. Detection of influential observations in regression. Technometrics 19: 15-18.

Herdiani, E.T., Sari, P.P. & Sunusi, N. 2019. Detection of outliers in multivariate data using minimum vector variance method. Journal of Physics: Conference Series 1341(9): 092004.

Hoaglin, D.C. & Welsch, R.E. 1978. The hat matrix in regression and ANOVA. The American Statistician 32: 17-22.

Hubert, M. & Debruyne, M. 2010. Minimum covariance determinant. WIREs Computational Statistics 2: 36-43.

Lichtinghagen, R., Klawonn, F. & Hoffmann, G. 2020. UCI Machine Learning Repository. Irvine: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml/datasets/HCV+data

Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India 2(1): 49-55.

Montgomery, D.C., Peck, E.A. & Vining, G.G. 2012. Introduction to Linear Regression Analysis. 3rd ed. New York: John Wiley & Sons.

Tukey, J.W. 1977. Exploratory Data Analysis. Massachusetts: Addison Wesley.

*Corresponding author; email: wuttsr@kku.ac.th

content